Contributors: Reed Dalton, Corey Haines, Alex Jansen, McKenna Karnes
Given a single survey question presented to visitors of a particular website, what percentage of people who truthfully answered the question selected yes?
* There are two types of clickers: Random Clickers (RC), and Truthful Clickers (TC)
* There are two answers to the survey: yes and no * The expected fraction of Random Clickers (RC) is 0.3 * The expected fraction of Truthful Clickers (TC) is 0.7
* The probability that the Random Clickers (RC) selects yes is .5 * The probability that the Random Clickers (RC) selects no is .5
The results of the survey were such that 65% said Yes and 35% said No.
prob.yes = 0.65
prob.no = 0.35
prob.yes.given.RC = 0.5
prob.no.given.RC = 0.5
prob.TC = 0.7
prob.RC = 0.3
prob.yes.given.TC = (prob.yes - (prob.RC * prob.yes.given.RC))/prob.TC
prob.yes = prob.TC * prob.yes.given.TC + prob.RC * prob.yes.given.RC
The percentage of Truthful Clickers who answered “Yes” to the the simple survey is 0.7142857
Medical test for disease using Bayes Formula.
* The sensitivity is about 0.993. That is, if someone has the disease, there is a probability of 0.993 that they will test positive.
* The specificity is about 0.9999. This means that if someone doesn’t have the disease, there is probability of 0.9999 that they will test negative.
* About 0.0025% of all people in the general population have the disease.
If someone tests positive for the disease, what is the probability that they are actually infected?
pop.infected = 0.000025
pop.clean = 0.999975
infected.pos = 0.993
infected.neg = 0.007
clean.pos = 0.0001
clean.neg = 0.9999
infected.test.pos = (infected.pos*pop.infected)/(infected.pos*pop.infected + clean.pos*pop.clean)
infected.test.pos = infected.test.pos * 100
If someone tested positive for the disease, there is only a 19.8882413% chance that they actually have the disease.
Given our calculations, implementing a universal testing policy for this disease would be problematic because for over 80% of positive test, the patient will not actually be infected with the disease. This could lead to mass hysteria and put unneeded strain on the population’s health systems.
In order to determine if investing in a green certification for a new mixed-use building in East Austin would be economically advantageous for the developer, we attempted to provide a more convincing analysis than preliminary recommendations made by the developer’s staff member.
Formulating our own independent analysis, we first narrowed down the available dataset of commercial rental properties to a subset that mimicked the environment presented at the East Austin development site.
Once we settled on the particular subset of data we sought to analyze, we began to compare the cost of green buildings to nongreen buildings in each cluster, hoping to identify a premium for green buildings that was reflected in the price of renting a green building.
After determining that a premium did in fact exist, we drafted our own recommendation regarding the profitabity of “going green” on the development. In order to make such a recommendation, we needed to provide evidence for why the revenue generated from the green rent premium exeeded the costs of constructing a green building.
gb = read.csv('greenbuildings.csv')
rent.days=plot_ly(data=gb,x=~total_dd_07,y=~Rent)
To create a target subset similar to the building being constructed, we needed to narrow down our dataset to the most relevant variables.
We first considered the role of utilities (total_dd_07,Gas_Costs, Electricity_Costs, net) on price. We found that reporting a high number of heating and cooling days per year did not correspond to relatively higher rents.
As a result, taking gas and electricity costs into account no longer intuitively felt useful in evaluating the Austin apartment’s rent. In some cases, utilities were even included in the building’s rent. Additionally, we did not know how the East Austin building planned to distribute the cost of utilities.
p = plot_ly(data=gb,x=~age, y=~stories)
Row Selection:
We decided to filter out building built over 50 years ago and building taller than 30 stories. This decision allowed us to capture a subset of building cluster more inline with the attributes of the proposed building in East Austin. (At the time of the new East Austin building’s opening,the building will be 0 years old and 15 stories tall).
gb = gb[gb$age <=50,]
gb = gb[gb$stories <=30,]
q = plot_ly(data=gb,x=~age, y=~stories)
The above plot reflects our new subset of data.
#filtering out green and non green buildings
greenbuildings = gb[gb$green_rating == 1,]
nongreen = gb[gb$green_rating==0,]
#Range of green rents and non green rents
green.rent = plot_ly(data=greenbuildings,x=~Rent)%>% layout(title = "Rent of Green Buildings", yaxis = list(title = "Frequency"))
non.green.rent = plot_ly(data=nongreen, x=~Rent)%>% layout(title = "Rent of NonGreen Buildings", yaxis = list(title = "Frequency"))
Both bar charts have somewhat wide ranges. Most of the rents falls in between 0 and $80/sqft. Because of this, the on-staff statistician’s method of separating and merely taking averages wasn’t prudent. A more accurate insight could be attained by analyzing clusters of green and non green buildings with similar characteristics to the building planned for East Austin.
#Set up vectors to hold the ratios
greenratios = c()
nongreenratios = c()
greenratios = greenbuildings$Rent/greenbuildings$cluster_rent
nongreenratios = nongreen$Rent/nongreen$cluster_rent
#average
average.green = mean(greenratios)
average.nongreen = mean(nongreenratios)
#averages on averages
quotientavg = average.green/average.nongreen
1.0675311
We determined that the premium on green rent was 6.75%. This is larger than the 5% premium on green construction in the Austin area, so over the expected lifetime of the building, a green building would generate more revenue than a standard building. Therefore, we will advise the real-estate developer to proceed with building a green development.
To understand the tradeoffs of risk and return of Exchange Traded Funds, we explored the growth over several years of historical data of the following ETFs: +US domestic equities (SPY: the S&P 500 stock index) +US Treasury bonds (TLT) +Investment-grade corporate bonds (LQD) +Emerging-market equities (EEM) +Real estate (VNQ) We considered three portfolios with varying levels of risk: +Even Split: 20% of assets in each of the five ETFs +Safer Split: investments assets in at least three classes +Aggresive Split: consolidating assets into fewer classes with the promise of higher returns at the cost of accepting greater risk *We have assumed the portfolio is rebalanced for free each day i.e. with zero transaction costs
#Import the stocks
mystocks = c("SPY", "TLT", "LQD", "EEM", "VNQ")
getSymbols(mystocks)
## Warning: LQD contains missing values. Some functions will not work if
## objects contain missing values in the middle of the series. Consider using
## na.omit(), na.approx(), na.fill(), etc to remove or replace them.
## [1] "SPY" "TLT" "LQD" "EEM" "VNQ"
#Adjust for splits and dividends
SPYa = adjustOHLC(SPY)
TLTa = adjustOHLC(TLT)
LQDa = adjustOHLC(LQD)
EEMa = adjustOHLC(EEM)
VNQa = adjustOHLC(VNQ)
SPYa.adj <- SPYa[,6]
TLTa.adj <- TLTa[,6]
LQDa.adj <- LQDa[,6]
EEMa.adj <- EEMa[,6]
VNQa.adj <- VNQa[,6]
SPYa1 = as.numeric(SPYa.adj[1])
TLTa1 = as.numeric(TLTa.adj[1])
LQDa1 = as.numeric(LQDa.adj[1])
EEMa1 = as.numeric(EEMa.adj[1])
VNQa1 = as.numeric(VNQa.adj[1])
SPYa.ts =SPYa.adj/SPYa1
TLTa.ts =TLTa.adj/TLTa1
LQDa.ts =LQDa.adj/LQDa1
EEMa.ts =EEMa.adj/EEMa1
VNQa.ts =VNQa.adj/VNQa1
basket <- cbind(SPYa.ts,TLTa.ts,LQDa.ts,EEMa.ts,VNQa.ts )
zoo.basket <- as.zoo(basket)
tsRainbow <- rainbow(ncol(zoo.basket))
plot(x = zoo.basket, ylab = "Cumulative Return", main = "Cumulative Returns",col = tsRainbow, screens = 1)
legend(x = "topleft", legend = c("SPY", "TLT", "LQD", "EEM", "VNQ"), lty = 1,col = tsRainbow)
*Next we plotted the Cumulative Returns of each stock from 2007 to 2017-08-09. +This allowed us to visualize each ETF’s gains/loses over time. +Visually, it seems that LQD has less volatility than EEM. +Rather than looking at this graph, we would like a way to standardize our comparison of the risk for investing in each ETF. +We believe value at risk (VaR) will be a good comparison measure.
#############################
#########Even Split Portfolio
#############################
#Setting up an all returns matrix showing closing prices for each day
all_returns = cbind(ClCl(SPYa),ClCl(TLTa),ClCl(LQDa), ClCl(EEMa), ClCl(VNQa))
all_returns = as.matrix(na.omit(all_returns))
n_days = 20#i.e. four weeks
#Setting up the Even Split Portfolio and using Bootstrap to find VaR
initial.wealth = 100000
sim.even = foreach(i=1:5000, .combine = 'rbind')%do%{
days = 20
total.wealth = initial.wealth
even.weight = c(0.2,0.2,0.2,0.2,0.2)
even.holdings = even.weight*total.wealth
even.wealthtracker = rep(0,days)
#loop over four trading weeks
for(today in 1:days){
return.today = resample(all_returns, 1, orig.ids = FALSE)
even.holdings = even.holdings + even.holdings*return.today
total.wealth = sum(even.holdings)
even.wealthtracker[today] = total.wealth
}
even.wealthtracker
}
#Calculate 5% VaR for even portfolio
even.split.VaR = quantile(sim.even[,n_days], 0.05) - initial.wealth
For our Even Split portfolio, we placed $20,000 in each of the 5 ETFs.
Even Split:
*At the five percent level, The value at risk for our even split portfolio is -6204.4200045.
##################
###### Safer Split
##################
safe_returns = cbind(ClCl(SPYa),ClCl(TLTa),ClCl(LQDa))
safe_returns = as.matrix(na.omit(safe_returns))
#Setting up the Safe Split Portfolio and using Bootstrap to find VaR
initial.wealth = 100000
sim.safe = foreach(i=1:5000, .combine = 'rbind')%do%{
days = 20
total.wealth = initial.wealth
safe.weight = c(1/3,1/3,1/3)
safe.holdings = safe.weight*total.wealth
safe.wealthtracker = rep(0,days)
#loop over four trading weeks
for(today in 1:days){
return.today = resample(safe_returns, 1, orig.ids = FALSE)
safe.holdings = safe.holdings + safe.holdings*return.today
total.wealth = sum(safe.holdings)
safe.wealthtracker[today] = total.wealth
}
safe.wealthtracker
}
#Calculate 5% VaR for safe portfolio
safe.split.VaR = quantile(sim.safe[,n_days], 0.05) - initial.wealth
For the Safe Split, we limited our portfolio to only 3 ETFs made up of the S&P, Treasury Bonds, and Corporate Bonds. Treasury Bonds are a safe investment because they are guaranteed by the Federal Government and are virtually risk free. Corporate Bonds are a safe investment because they will pay out unless the company defaults. And the S&P is a good sample of how the entire stock market functions, and is diversified enough to not be too influenced by sector fluctuations.
Safe Split:
*At the five percent level, The value at risk for our even split portfolio is -3093.1892357.
########################
####Aggressive Portfolio
########################
aggressive_returns = cbind(ClCl(EEMa),ClCl(VNQa))
aggressive_returns = as.matrix(na.omit(aggressive_returns))
#Setting up the Agressive Split Portfolio and using Bootstrap to find VaR
initial.wealth = 100000
sim.aggressive = foreach(i=1:5000, .combine = 'rbind')%do%{
days = 20
total.wealth = initial.wealth
aggressive.weight = c(1/2,1/2)
aggressive.holdings = aggressive.weight*total.wealth
aggressive.wealthtracker = rep(0,days)
#loop over four trading weeks
for(today in 1:days){
return.today = resample(aggressive_returns, 1, orig.ids = FALSE)
aggressive.holdings = aggressive.holdings + aggressive.holdings*return.today
total.wealth = sum(aggressive.holdings)
aggressive.wealthtracker[today] = total.wealth
}
aggressive.wealthtracker
}
#Calculate 5% VaR for aggressive portfolio
aggressive.split.VaR = quantile(sim.aggressive[,n_days], 0.05) - initial.wealth
For the Aggresive Split, we shrunk our portfolio to only 2 ETFs made up of Real Estate and Emerging Market Equities. Real Estate is a risky investment as one could see from the 2008 housing bubble and the forclosures caused by the sub-prime mortgage crisis, but can grow quickly based on speculation. Emerging Market Equities have a high chance of failing due to geopolitics and various other factors, but there are high growth oppurtunities.
Aggresive Split:
*At the five percent level, the value at risk for our even split portfolio is -1.353049110^{4}.
Comparison:
Even Split - -6204.4200045
Safe Split - -3093.1892357
Aggresive Split - -1.353049110^{4}
The Safe Split had a 49% decrease in the value at risk than the Even Split. The Aggressive Split had a 112% increase in the value at risk than the Even Split. Our selections for the ETFs in our Safe Split portfolio and in our Aggressive Split portfolio were valid selections. Thus based on an investor’s time horizon and specific needs, each portfolio could be more desirable.
social_marketing = read.csv('social_marketing.csv')
smm = social_marketing[ , -which(names(social_marketing) %in% c("chatter","spam", 'uncategorized'))]
sm <- smm[,-1]/rowSums(smm[,-1], na.rm = FALSE)
sm_scaled <- scale(sm, center=TRUE, scale=TRUE)
To start, we found the proportion of each follower’s tweets per category over their total number of tweets.
Next, we determined that there was likely no realistic way to use chatter and uncategorized to find interesting insight into NutrientH2O’s followers other than the propensity of each follower for tweeting. Because we already decided to scaled each category by the total number of tweets, we chose to drop these catagories.
Following the previous point, it did not seem like a feasible idea to create a marketing campaign based off of spam tweets. +No account tweets out more than 2 tweets categorized as spam, and the highest percentage of an account’s tweets being categorized as spam was 7%. +Thus for the simplicity of the model, we chose to drop the spam category. While it is probably not a strong branding idea for NutrientH2O to develop a marketing campaign based on adult themes, we felt that leaving the adult category in the data made sense, as it had the posibility to capture some demographic information about NutrientH2O followers. *Although potentially redundant, our final step was to center and scale our data to ensure that our kmean centers would be interpretable.
Our goal is to group the categories of tweets into an optimal K-number of clusters.
We first experimented with finding the number of clusters with the highest CH(k) score when running kmeans++. From the graph, the highest score came when the data was grouped into 3 clusters.
ch.index = function(x,kmax,iter.max=100,nstart=10,algorithm="Lloyd") {
ch = numeric(length=kmax-1)
n = nrow(x)
for (k in 2:kmax) {
a = kmeanspp(x,k,iter.max=iter.max,nstart=nstart,algorithm=algorithm)
w = a$tot.withinss
b = a$betweenss
ch[k-1] = (b/(k-1))/(w/(n-k))
}
return(list(k=2:kmax,ch=ch))
}
a1 = ch.index(sm_scaled, 10)
k1.hat = a1$k[which.max(a1$ch)]
plot(a1$k,a1$ch, xlab='K', ylab='CH(K)', type='b', main='K-Means Clustering : CH Index vs K' )
Next we ran Kmeans++ with 3 clusters, as Kmeans++ solves an augmented objective function that allows it to find the most optimal clustering faster than normal Kmeans.
maxCH = kmeanspp(sm_scaled, 3,iter.max=100,nstart=10,algorithm="Lloyd")
In order to make some sort of inferences about the interests of NutrientH2O’s followers for target marketing and branding, we examined the distance of each category from the cluster center. We then sorted based on distance and created barcharts for each of the clusters.
ch_cluster_1 = sort(maxCH$centers[1,])
ch_cluster_2 = sort(maxCH$centers[2,])
ch_cluster_3 = sort(maxCH$centers[3,])
par(mfrow=c(1,3))
barchart(ch_cluster_2)
barchart(ch_cluster_3)
barchart(ch_cluster_1)
Inferences:
* Something about these three clusters doesn’t feel right. * The most noticeable attribute is that the second cluster has no distinct groupings on the high or low end. * Intuition says religion
Continuing our exploration for the optimal number of cluster for use in kmeans, we implemented the Elbow Method. The Elbow Method requires running kmeans or kmeans++ on multiple k-number of clusters and plotting them based on their sum of squared errors (SSE). Next, A visual inspection is performed on the line chart. If there is a noticable change in the direction of the line, the inflection point (i.e. the Elbow) is the ideal number of clusters for Kmeans.
wss <- matrix(,nrow = length(2:10), ncol = 1);
for (i in 2:10) wss[i] <- sum(kmeanspp(sm_scaled,i,iter.max=100,nstart=10,algorithm="Lloyd")$withinss)
plot(1:10, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",
main="Assessing the Optimal Number of Clusters with the Elbow Method",
pch=20, cex=2)
From the graph, there ever so slightly appears be an inflection point at 6 clusters.
maxElbow = kmeanspp(sm_scaled, 6,iter.max=100,nstart=10,algorithm="Lloyd")
elbow_cluster_1 = sort(maxElbow$centers[1,])
elbow_cluster_2 = sort(maxElbow$centers[2,])
elbow_cluster_3 = sort(maxElbow$centers[3,])
elbow_cluster_4 = sort(maxElbow$centers[4,])
elbow_cluster_5 = sort(maxElbow$centers[5,])
elbow_cluster_6 = sort(maxElbow$centers[6,])
par(mfrow=c(3,2))
#Clusters plotted in order of decreasing withinss
barchart(elbow_cluster_6)
barchart(elbow_cluster_2)
barchart(elbow_cluster_1)
barchart(elbow_cluster_3)
barchart(elbow_cluster_4)
barchart(elbow_cluster_5)